Multimodal Residual Learning for Visual QA

نویسندگان

  • Jin-Hwa Kim
  • Sang-Woo Lee
  • Dong-Hyun Kwak
  • Min-Oh Heo
  • Jeonghee Kim
  • JungWoo Ha
  • Byoung-Tak Zhang
چکیده

Deep neural networks continue to advance the state-of-the-art of image recognition tasks with various methods. However, applications of these methods to multimodality remain limited. We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning. Unlike the deep residual learning, MRN effectively learns the joint representation from vision and language information. The main idea is to use element-wise multiplication for the joint residual mappings exploiting the residual learning of the attentional models in recent studies. Various alternative models introduced by multimodality are explored based on our study. We achieve the state-of-the-art results on the Visual QA dataset for both Open-Ended and Multiple-Choice tasks. Moreover, we introduce a novel method to visualize the attention effect of the joint representations for each learning block using back-propagation algorithm, even though the visual features are collapsed without spatial information.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Critical Visual Analysis of Gender Representation of ELT Materials from a Multimodal Perspective

This content analysis study, employing a multimodal perspective and critical visual analysis, set out to analyze gender representations in Top Notch series, one of the highly used ELT textbooks in Iran. For this purpose, six images were selected from these series and analyzed in terms of ‘representational’, ‘interactive’ and ‘compositional’ modes of meanings. The result indicated that there are...

متن کامل

A Social Semiotic Analysis of Social Actors in English-Learning Software Applications

This study drew upon Kress and Van Leeuwen’s (2006, [1996]) visual grammar and Van Leeuwen’s (2008) social semiotic model to interrogate ways through which social actors of different races are visually and textually represented in four award-winning English-learning software packages.  The analysis was based on narrative actional/reactional processes at the ideational level; mood, perspective, ...

متن کامل

Multimodal follow - up questions to multimodal answers in a QA system

We are developing a dialogue manager (DM) for a multimodal interactive Question Answering (QA) system. Our QA system presents answers using text and pictures, and the user may pose follow-up questions using text or speech, while indicating screen elements with the mouse. We developed a corpus of multimodal follow-up questions for this system. This paper describes a detailed analysis of this cor...

متن کامل

Learning to Answer Questions from Image Using Convolutional Neural Network

In this paper, we propose to employ the convolutional neural network (CNN) for learning to answer questions from the image. Our proposed CNN provides an endto-end framework for learning not only the image representation, the composition model for question, but also the intermodal interaction between the image and question, for the generation of answer. More specifically, the proposed model cons...

متن کامل

A Multimodal Discourse Analysis of Some Visual Images in the Political Rally Discourse of 2011 Electioneering Campaigns in Southwestern Nigeria

This paper presented a multimodal discourse analysis of some visual images in the political rally discourse of 2011 electioneering campaigns in Southwestern Nigeria. The data comprised purposively selected political visual artefacts from political rallies across the six Southwestern States in Nigeria (Osun, Oyo, Ondo, Ekiti, Ogun, and Lagos). The data were analyzed using Halliday’s (1985) syste...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016